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Abstract — Probabilistic generative modeling of data distribu- 
tions can potentially exploit hidden information which is useful 
for discriminative classification. This observation has motivated 
the development of approaches that couple generative and dis- 
criminative models for classification. In this paper, we propose 
a new approach to couple generative and discriminative models 
in an unified framework based on PAC-Bayes risk theory. We 
first derive the model-parameter-independent stochastic feature 
mapping from a practical MAP classifier operating on generative 
models. Then we construct a linear stochastic classifier equipped 
with the feature mapping, and derive the explicit PAC-Bayes risk 
bounds for such classifier for both supervised and semi-supervised 
learning. Minimizing the risk bound, using an EM-like iterative 
procedure, results in a new posterior over hidden variables (E- 
step) and the update rules of model parameters (M-step). The 
derivation of the posterior is always feasible due to the way of 
equipping feature mapping and the explicit form of bounding risk. 
The derived posterior allows the tuning of generative models and 
subsequently the feature mappings for better classification. The 
derived update rules of the model parameters are same to those of 
the uncoupled models as the feature mapping is model-parameter- 
independent. Our experiments show that the coupling between 
data modeling generative model and the discriminative classifier 
via a stochastic feature mapping in this framework leads to a 
general classification tool with state-of-the-art performance. 

Index Terms — stochastic feature mapping; PAC-Bayes risk 
bound; hybrid generative-discriminative classification 

I. Introduction 

Discriminative models designed to find decision boundaries 
among diff'erent classes are state-of-the-art tools for classifica- 
tion, while probabilistic generative models seeking to model 
data distributions are adept in exploiting hidden information, 
in dealing with structured data (e.g. protein sequence with vari- 
able length) and in solving nonlinear classification problems 
using maximum a posterior (MAP) classifier The comple- 
mentarities of the two paradigms have been investigated 1 19l . 
lUl, resulting in several promising works 10, iBTI . Il23l . |l9l . 
The above observations have emerged from these works in the 
context of classification: (1) generative models provide feature 
mappings that simultaneously exploit hidden information, and 
transform structured data into a fixed dimensional feature; (2) 
discriminative models find an optimum decision boundaries in 
such a feature space under specified criterion. 

Generative score space methods Jl], (|9], lfT4l are motivated 
by the above observations. These methods derive feature map- 
pings from the log likelihood (or its lower bound) of generative 
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models. These feature mappings are measures over models 
P(x, h, 6*), taking the form of Ep(h|x)[<^(x, h, 0)] where is a 
function over the observed variable x and the hidden variable 
set h. They map observed and hidden variables into a vector 
of score, which are then used as features by classifiers. These 
methods exploit the superior abilities of generative models in 
exploiting hidden information and dealing with structured data. 
However, in these methods, generative models are isolated 
from the classification process and there is no principled way 
to tune the generative models as well as the feature mapping to 
improve classification. It is desirable to develop a mechanism 
that can couple the classifier to the generative models to allow 
fine-tuning of the feature mapping. 

Maximum entropy discrimination |19 | provides yet another 
framework to exploit generative models for classification under 
the large margin principle. This framework, however, requires 
deliberately choosing conjugate priors for parameters of the 
generative models, which limits its appUcation to complex 
models. In addition, the VC risk bound fTTl utilized by this 
method is generally loose in comparison with the PAC-Bayes 
bounds lfT3l . Q, lIU. Also, there are some other eff'orts ||2TI . 
[|23l made to couple generative and discriminative models 
for classification. However, these methods provide no explicit 
feature mapping which is useful in real applications. Further, 
they requires re-formulating the update rules of the parameters 
of generative models, which is typically complex. 

This paper proposes an approach based on the PAC-Bayes 
theory |13|, [5|, |2| to integrate the complementary strengths 
of generative and discriminative models. Using the linear form 
of a practical MAP classifier operating on generative models, 
we derive the model-parameter-independent stochastic feature 
mapping. By the feature mapping, we meant that the feature 
used for classifications is a function of the input data and 
the hidden variables of the generative models. This is distinct 
from the current methods O], |l9], |fT4| which map a data point 
to a feature deterministically. Then we construct a stochastic 
classifier, a Gibbs classifier operating on the derived feature 
mapping, and derive explicit PAC-Bayes risk bounds for such 
a classifier By minimizing the risk bound using an EM-like 
iterative procedure, we derive the posterior over the hidden 
variables (E-step) and the update rules of model parameters 
(M-step). The derivation is always feasible due to the way of 
equipping feature mapping and the form of bounding risk. The 
posterior provides a bridge that allows the classifier to tune the 
generative models and subsequently the feature mapping for 
classification. The update rules of model parameters are quite 
simple - essentially same to those of the uncoupled models 
as the feature mapping is model-parameter-independent. 
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II. From MAP classifier to Stochastic Feature Mappe^jg 

In this section, for exponential family generative models, we 
drive the linear form (Eq. (|4)) of the MAP classifier (Eq. (|3)) 
based on the variational approximation (Eq. ([2}), and use that 
to derive a stochastic feature mapping. The derived feature 
mapping is functioning similar to |3|, |9|, |14|. Consider the 
binary classification problem that assigns labels y e {-1,+1) 
to examples x e W*. Let P(x \ 0,) be the class-conditional 
distributions over x; P(y) be the prior of labels. The decision 
rule of the MAP classifier is y = max, P(x | 6y)P(y), which is 
equivalent to y = sign(L(x; ©)) where sign(fl) = +1 if a > 
and sign(fl) = -1 otherwise, and the discriminant function: 



L(x, 0) = log P(x I e+) - log P(x \0^)+b 



(1) 



where subscripts +,- are the shorts of +1,-1; @ = {6-,d+,b}; 
/7 = log P(y = +1) - log P(y = -1). When P(x \ dy) is modeled 
by a generative model P{x, h | dy) with a set of random hidden 
variables h, it is diflicult to obtain a close form of P{x \ 9y) 
since JP(x, h | 0,,)iih is usually intractable. We can resort to 
the following variational lower bound Q, IS: 

log P(x I 0,) > Ee(h) [log P(x, h) - log e(h)] ^ F(x, 0y) (2) 

where 2(h) is the variational approximate posterior of P(h \ x). 
Then, instead of the intractable discriminant function (Eq. ([T]l), 
we resort to the following tractable one llT9l . Il9l 



L(x, 0) = F(x, 6+) - F{x, 9J) + b 



(3) 



We assume the generative model P{x, h | 9) belong to the 
exponential family which covers most models. We have the 
general forni P(x,h) = exp{fl(6')^r(x,h)+5(x,h) + c/(6l)) where 
9 is the vector of parameters; r(x, h) is the vector of suflicient 
statistics; 5(x, h) and d{6) are scalar functions. Similarly, the 
prior over h is f (h) = exp{c(6l/,)^r(h) + 5(h) +/(6l/0). Further, 
we assume that the approximate posterior of h, for the example 
X, takes the same from with its prior P(h) but with diff'erent 
parameter f4l 2(h) = exp{c(0;)^r(h) + 5(h) + /(e;)). Substi- 
tuting the above formulas of P(x, h) and 2(h) into Eq. (O, 
it can be verified that F{x, 9) = Ee(h)[log P(x, h) - log ^(h)] = 
Q;^Eg(h)[f (x,h)]-HyS, where a^{a{9f , 1,-1^,-1,-1); /? = <i(0); 

f(x,h) = (r(x,h)^5(x,h),(diag(c(0;)r(h)))^5(h),/(0;))^ 

For a pair of models 9+ and 9-, Eq. (O can be written as: 
Lix, 0) = a^Ee(h,)[f (X, h+)] -a^Ee(h_)[f (X, h_)] + /3^-/3^+b 



= S^EQ(h^,h_)[0(x,h+,h_)] 



(4) 



where a = {al,-aL,l3+ -yS- + bf, 0(x,h+,h_) = (f(x,h+)^, 
f (x, h_)^,l)^. Eq. (|4]i takes the form of the linear classifier, 
where Eq[0(x, h+, h_) is considered to furnish a feature map- 
ping. From another perspective, (p{x, h+, h_) can be considered 
as a stochastic feature mapping because the hidden variables 
h+, h_ are all conditioned on the example x and thus its value 
can serve as feature for identifying x. It is considered to define 
a stochastic feature space because it is evaluated based on 
stochastic examples drawn from the posterior of h. 

III. PAC-Bayes bound stochastic classifier 

The derived stochastic feature mapping 0(x,h+,h_) makes 
it possible to jointly learn generative models (subsequently 



feature mappings) and classifier We construct a linear Gibbs 
classifier over this stochastic feature mapping: 



Gq = sign[w ■ 4>{x, h+, h_)] = /„(x, h+, h_) 



(5) 



where w is the weight of classifier; h+, h_, w follow some 
distribution Q which will be specified in Section IIII-BI Gibbs 
classifier with such a feature mapping olJers several advan- 
tages. First, this classifier allows PAC-Bayes risk bounds that 
have explicit solutions for 2(h+,h_) which can help tune the 
feature mapping for better classification; Second, the PAC- 
Bayes risk bound for such a classifier can be tighter than VC 
bounds IJ li ; Third, the feature mapping is independent with 
model parameters 9, making the solution of 6 very simple. 

A. PAC-Bayes bounds for stochastic feature mapping 

Let X be the input space consisting of an arbitrary subset 
of W' and J/ - be the output space. An example 

is an input-output pair (x,y) where x e X and y e J/. In a 
PAC-Bayes setting ITSll . each example {x,y) is drawn from 
a fixed, but unknown, probability distribution D on A" x J/. 
Let /(x, v) : — > J/ be any classifier with a set of variables 
V. The learning task is to choose a posterior distribution Q 
over a space 'F of classifiers and a space 'V of variables such 
that the Q-weight majority classifier Bq - sign[E(/_v)~G/(x, v)] 
will have the smallest possible risk on the training example 
set S = {(xi,}?!), ■ ■ ■ ,{Xi„,y„)}. The output of Bq is closely 
related to the output of the Gibbs classifier Gq which first 
chooses a classifier / and a vector v according to Q, and then 
classifies an example x. The true risk R(Gq) and the empirical 
risk RsiGg) of this Gibbs classifier are given by: 



R(Gq) = E E I(/(x,v)^y) 

(/,v)~e (x,y)~o 



1 

(Gq) = E -V I(/(x,-,v)^ 

(/,v)~e m ^ 
1=1 



yd 



(6) 



(7) 



This setting is naturally accommodated by PAC-Bayes theory 
since v can be considered as a part of /. Among several PAC- 
Bayes bounds Cni, is], Q, lEl, the bound derived in [2] is 
quite tight and gives an explicit bound for the true risk R{Gq), 
which allows the derivation of the posterior Q, in contrast to 
most of the other impUcit bounds over KL{Rs(Gq)\\R{Gq)). 

Theorem 1. For any distribution D over Xxif, any space T of 
classifiers, any space 'V of random variables, any distribution 
P over TyfV, any 6 e (0, 1], and any real number C > 0, V Q 
over 'Fx'Y, we have: 

1 



P^R(Gq) 



l-exp\ -CRs(Gq) 



— [KL(ei|P)-ln5] 
m 



>l-6 



This is a slight extension of Corollary 2.2 of [2\, and can be 
proved by replacing / with (/, v) and reapplying its proof [2|. 

The above risk bound is derived for labeled data. Here we 
have extended the bound to accommodate both labeled and 
unlabeled data for semi-supervised learning in the following 
theorem. The semi-supervised bound is diff'erent with lfT2l 
whose bound is implicit and has no explicit solution of Q. 
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Theorem 2. For any distribution D over /V x J/, any space T 
of classifiers, any space 'Y of random variables, distribution 
P over 'Fx'Y, any 6 e (0, 1], and any real number C > 0, V Q 
over 'Fx'Y, we have: 



Pr^(Ge)< 



1 



1 -, 



1 - exp - C 



esiGo) + ^ds(GQ) 



— [KLiQ\\P)-\n6] 
m 



>l-5 



where the risks for labeled and unlabeled data are es (Gq) — 
E(/i,vi)~eE(/2,v2)~Gi; 2i=i l(/i(x;,vi) 9i y,))I(/2(x/,V2)^3'i)) and 
dsiCg) = E(/,,v,)~GE(/2,v2)~Gi; 2™i I(/i(x;, Vi) 5^ /jCx,-, V2)). 

Proof: Let E/ be the abbreviation of E(/_v)~e. Note that 
E/„/J(/i * fi) = E/„/,2I(/i ^ y)I(/2 = y) = E/,;/,2I(/i ^ 
IC/2 ^ y)) = E/,j,2(I(/i ^ ^ y)I(/2 ^ y)) and /?s(Ge) = 

i;2:,E/,I(/i ^y,) (Eq. ©). Therefore «5(Ge)= ^ i:,E/,I(/; ^ 

y,) = h 2. E/„/2i(/]' ^ yOK./^' ^ yd + 5!^ 2:, E/,ja(/j- ^ /^) = 

es(Ge) + iflfs(GQ), (/; =/i(x,)). Substituting Rs{Gq) = esCGg) 
+ jdsiGo) into Theorem [T] then we obtain Theorem |2] ■ 
Since dsiGq) is independent of labels, it allows classifiers 
using the above bound to exploit unlabel data. Minimizing this 
risk dsiGg) would contract the posteriors over the stochastic 
classifier and the stochastic feature space, making classifica- 
tion and feature mapping less uncertain. 

B. Objective function and specification 

Let B(Q, C) = ^[1 - exp{-7(e) + i In 5)] be the upper 
bound in Theorem |2] where 7(0 = CiesiGg) + jdsiGg)) + 
-KL(Q II P). Training a classifier with minimum risk means 
minimizing the upper bound B{Q, C) w.r.t. Q and C. Note 
that minimizing B(Q,C) w.rt. Q equals to minimizing J{Q) 
w.r.t. Q. Since unlabeled data are only available in estimat- 
ing ds{Gg), J{Q) over labeled data Si of nii examples and 
unlabeled data 5„ of m„ examples can be written as 



J(Q) = C 



es,(Gg) + ^ds„(Gg) 



+ -KL(Q\\P) 
m 



where m - mi + niu- This form enables us to derive the 
analytical form of posterior distributions Q of the classifier and 
the hidden variables. Apply the above bound to the stochastic 
classifier defined in Eq. (|5]l and set v - (h+,h_), we have 



/ - /w, E/^2[-] - E„^q[-] 



(8) 



Then learning PAC-Bayes bound classifier with the generative 
model embedding is to minimize the objective function J{Q) 
w.rt. the posterior h+,h_) and parameters 6+,0-: 



J(Q) = C 



es,{Gg) + ]^dsSGg) 



+ -[KL(e(w)||P(w)) 
m 



(9) 



+ KL(e(h+) II P(x, h+ I e^)) + KL(e(h_) II P(x, h I 

where Gg is the linear stochastic classifier defined in Eq. (|5); 
P(x, h+ I 6+) and P{x, h \ 9-) are generative models for positive 
and negative classes respectively; the first row is the objective 
function for regular Gibbs classifier and the second row is the 
objective function for two generative models. 

To compute the objective function J{Q) in Eq. (|9]l, we will 
need to have approximations or expressions for e^,, and 



KL(2||f) that are computationally tractable. To derive these 
expressions, as were done in fSl, we assume that the prior of 
the weight is Gaussian f (w) = A^(uo, I) and its posterior is also 
Gaussian except with a different mean, i.e., Q(y/) - N{vl,1). 
Based on this assumption, we have: 



KL(e(w)||P(w)) = i llu-uol 



(10) 



Using the assumption and Gaussian integrals |5|, we have, 

E I(/„(x,h+,h_)^3;) = (5(3;u-^(x,h+,h_)) (11) 

where (l.(a) = /fi^e^P(-T)«'-^=5erfc(^)and^ = l^i|^. 
Further, considering Eq. (flU . we have the integration: 

E I(/w, + /wJ = E 2I(/„, ^ l)I(/„, ^ -1) (12) 

= 20(u ■ 0(x, h+, h_)) (5(-u ■ 0(x, h+, h_)) 

With these formulas, we proceed to obtain an expression for 
7(0, and find e(h+, h_), ^(w), 9+ and 9- by minimizing 7(0 
with an iterative optimization procedure in the next section. 



IV. Inference and parameter estimation 

In this section, we derive the learning procedure (inference 
and parameter estimation) of the proposed approach. Consider 
Eq. © and es{Gg) = Ew,I(/„,(x,) + y,) Ew,I(/w,(x,) ^ 
yd and ds{Gg) = i:,Ew,,w2l(/w,(x/) + f^,(xd) (Thll, 7(0 
(average cost) over the training set 5 = 5/ U 5„ is: 

JiQ)^C - VEel(/„(x,) ^y,.) + _yEel(/„,(x,.) ^ /„,(x,.)) 

mi^ m,,^ 

L ' i=l " (=1 

1 1 
+— y KL(0(h+,h_)||F(x,,h+,h_)) + -KL(e(w)||P(w)) 
' m 

i=\ 

where the terms EqI(/„(x,) y,), Egl(/w,(x,) /w,(x,)) and 
KL(G(w) II P(w)) have been respectively given by Eq. ( fTTT i. 
Eq. (fT2l i and Eq. ( fTOl i; P(x,, h+,h_) is the abbreviation of 
P(x,, h+)P(x,, h_). We now show how an EM-like iterative 
procedure ||4] can be used to learn the stochastic feature space 
and the Gibbs classifier simultaneously. 



A. Inference: minimize J(Q) w.rt. 2,(h+,h_) 

In the first step, we fix 2(w), 0+, 6'_, and minimize 7(2) 
w.r.t. g,(h+,h_), subject to Jg,(h+, h_)iih+(ih_ = 1. Benefiting 
from the explicit bound (Th. 2), we has the following solution: 



0(h+,h_) = lp(h+,h_,x,)exp{-CEg(„)[^,]) 



(13) 



where ipi = |;I(/w * yd if x; e Si and ipi = 5|-I(/wi * /w,) if 
X/ e S ,1- Note that Eq(„)[i^,] is given by Eq. dTll i and Eq. (fT2l) . 
The fact that the output of classifier is inside the expression 
for posteriors means that the generative models are being 
tuned as well when the classifier is being optimized during 
the minimization of PAC-Bayes bound. This tuning inhibits 
those examples of h+, h_ that lead to misclassification and 
encourages those with less misclassification. Sampling from 
this posterior is simple using Gibbs -rejection sampling, be- 
cause P(h, X,) can be directly used as the comparison function 
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since P(h,x,)exp(-) < P(h,x,) (Eg(„)[^c',] > as I(-) is a zero- 
one output function). Considering the j-th example h,y drawn 
from P(h, X,), we reject it if Qiihij) < rj where rj is drawn 
from the uniform distribution over [0,P(h,j, x,)]. 

B. Parameter estimation: minimize J(Q) w.r.t parameters 

In the second step, we fix the posteriors 2,(h+>h-) and 
parameters 6+,6-, and determine the posterior distribution 
2(w). Instead of sampling from 2(w), we directly determine 
its parameter u by minimizing J{Q) w.rt. u. Since J{Q) w.rt. 
u is intractable, we resort to minimizing its upper bound J(u) 
(see the Appendix) w.rt. u. The gradient of 7(u) w.r.t. u is: 



dJju) 
du 



i(u- 

m 

C 



uo) ■ 



niiin^ij 



nil,,!! 
-J 



Ku ■ 0,7) [<l)(u ■ ^ij) - <l)(-u ■ 4>ij)] ^ij 



where ^(0 is the gaussian function with mean /i = and std 
6-1. The gradient of B(Q,C) with respect to C is: 

-c r 2 

l-exp(-y(e) log<5) 



dB(Q,C) 



dC 



+ —^[RsSGq) + *„(Ge)] exp(-7(0-i logj) 
1 - e ^ m ' 

In the f/z;>£/ ifep, we fix 2,(h+, h_), u and update parameters 
0+,0-. Note only the third term of Eq. (|9}, i.e., the objection 
function of the positive model, involves 6+. So the update 
rules of 6+, derived by minimizing Eq. (l9| w.r.t. 9+, are same 
as those of the original generative model. Similarly for 

The learning procedure is summarized in Algorithm [T] In 
classification, similar with [2J, we use the decision rule of 
majority vote y = sign[i 2"=i Ep(h+,h_ix,)e(w)W ■ (p(Xi,h+,h-)] ^ 
sign[i u-(f>iXi,h+ij,h_ij)] with n = 5 and (h+,j,h_y) being 
the j-th example drawn from f(h+,h_|x,). 

Algorithm 1 Inference and learning 

input: data set Si,Su, and S'i,S[, are fractions of Si,Su 
initialize u, learning rates y„,7'c, and 6 = 0.05 

Uo ^ miuu Rs'iiGQ) + ^ds'^iGg) 
repeat 
for i = 1 to m do 

sample 2,(h+,li_) using Gibbs-rejection sampling 
end for 

update 0+ with {h+y),j (x,- € S^) using the rules of the 
original generative model. Similar for 

dj(u) dB 

until convergence 
output: Uo, u, 6+, 9- 



V. Experiments 

This section empirically evaluates the proposed method 
stochastic feature mapping (SFM) and related methods on 
general classification tasks, scene recognition and protein se- 
quence classification respectively. For multiple-class classifica- 
tion problem, we divide it into binary classification problems, 
each of which is an one-versus-rest problem that distinguishes 
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Fig. 1: Classification accuracy (%) for varying number of training examples. 
left: Tissue (UCI); middle: Highway (scene); right: Superfamily #2 (protein). 



one class from others. For each binary problem, we randomly 
partition the positive examples into 50% training and 50% test 
sets, and similarly for negative examples. We test each binary 
classification problem on 20 random partitions, and report the 
average results. For the semi-supervised version, we use 25% 
of test examples as unlabeled data. Two related and general 
methods. Fisher score (FS) (E] and free energy score space 
(FESS) |9|, and some other state-of-the-art methods are also 
tested for comparison. 

There are two points in implementation. First, the optimiza- 
tion procedures of uo and u may suffer from the local minima 
problem, resulting in poor solution. The strategy adopted 
by 121 is to perform the optimization for 10 ~ 100 trials where 
a new random initial point within the range [-20, 20]'' is used 
in each trial. Second, the value of parameter C has been shown 
to be important. Another effective strategy experimented is to 
assess the performance using 10-fold cross-validation. 

A. Deriving a general classification tool 

In the first experiment, we derive a general classification 
method by applying the proposed framework to a simple yet 
general generative model, Gaussian mixture model. Let x e R"' 
be the observed variable; z = {zi, ■ ■ ■ ,Zjt) be the hidden binary 
indicate vector for K mixture components, and assume the co- 
variance matrix be diagonal; a = {oi ,■■■,«<:) be the parameters 
of the approximate posterior of z. The elements of the feature 
mapping of this model are {z/(x^, diag(xx^), l),z, logfl,)^j. 
The posterior of z can be easily derived from Eq. (fTST l. 
The number of mixture components is configured Xo K - A 
throughout the experiment. 

We select 8 data sets from UCI database for evaluation, pre- 
ferring those with no missing entities. The number of classes 
of each data set is between 2 and 15. The number of examples 
of each class varies from 14 and 673. The dimensionality 
is between 9 and 90. We compare our method SFM with 
Adaboost [il7i . SVM ifTTI . localized multiple kernel learning 
(LMKL) (H and PAC-Bayes gradient descent PBGD3 |2l. 
The average results are reported in Table D It shows that SFM 
is adaptive to different data sets and outperforms other methods 
in half of the data sets. It is also worth noting that the linear 
version of PBGD3 does work well in these evaluation. The 
results of semi-supervised version is presented in Fig. [1] 

B. Scene recognition 

We evaluate our SFM method and compare its performance 
against comparable methods on a typical vision task, scene 
recognition. In this task, visual words are used for image 
representation for its robustness to topic and spatial variance. 
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TABLE I: Classification accuracy (%±std) on UCI database (one-versus-rest on each dataset). 



DATA 


Adaboost 


SVM 


LMKL 


PBGD3 


SFM-GMM 


Cancer 


93.24 ± 1.26 


96.80 ± 1.79 


96.41 ± 0.97 


93.98 ± 1.52 


95.02 ±0.95 


Tissue 


88.55 ±5.91 


78.17 ± 12.27 


87.69 ± 5.24 


88.14 ±0.50 


89.04 ± 3.12 


Wine 


92.98 ±3.42 


97.73 ± 1.86 


95.48 ±4.10 


92.22 ± 12.63 


95.79 ± 1.33 


Sonar 


70.87 ±4.76 


73.11 ±3.25 


80.21 ± 1.52 


75.52 ±5.70 


80.60 ± 4.24 


Credit 


84.74 ± 1.30 


84.74 ± 1.79 


81.92 ± 1.41 


83.53 ± 1.82 


84.89 ± 1.30 


SpHeart 


78.65 ±2.08 


74.66 ±3.56 


80.38 ± 3.40 


79.70 ± 0.65 


80.84 ± 0.49 


Libras 


92.97 ± 1.76 


87.54 ±7.01 


96.58 ± 1.78 


94.52 ± 2.80 


95.61 ±3.34 


Steel 


89.47 ± 9.08 


86.43 ±9.16 


92.63 ± 8.14 


87.30 ±8.26 


89.10 ±8.49 



TABLE II: Accuracy (%±std.) of one-versus-rest scene recognition. 



SCENE 


PHOW-SVM 


LDA-MAP 


FS-LDA 


FESS-LDA 


SFM-LDA 


Coast 


90.66 ± 0.65 


83.85 ±0.92 


90.42 ± 0.34 


93.89 ± 0.46 


94.06 ± 0.64 


Forest 


96.49 ± 0.39 


94.94 ± 0.46 


94.45 ± 0.46 


97.92 ± 0.26 


98.10 ± 0.32 


Mountain 


92.58 ± 0.64 


84.99 ± 1.78 


88.62 ±0.50 


93.29 ± 0.47 


93.80 ± 0.44 


Country 


91.38 ± 0.71 


72.30 ± 1.74 


87.40 ±0.46 


90.62 ± 0.33 


90.82 ±0.62 


Highway 


95.27 ±0.49 


81.50 ± 1.28 


92.48 ± 0.22 


94.67 ± 0.34 


95.46 ± 0.30 


InsideCity 


93.96 ±0.62 


85.14± 1.74 


90.79 ±0.14 


94.26 ± 0.65 


95.27 ± 0.35 


Street 


93.89 ±0.64 


76.46 ± 1.23 


93.76 ±0.24 


94.21 ±0.42 


95.03 ± 0.47 


Building 


94.40 ± 0.49 


87.85 ±0.55 


92.83 ± 0.57 


96.06 ± 0.51 


95.81 ±0.38 



We use latent Dirichlet allocation (LDA) UJ to model the 
distributions of visual words, and derive a recognition tool 
under the proposed framework. Like [15 1, we sample the topic 
variable using collapsed Gibbs sampling and reject examples 
according to the rule for Eq. (ITJt . We fix the parameter a 
and allow [5 [TSl to be updated. Let w, z respectively indicate 
word and topic, and y be the parameter of the approximate 
posterior of z. The elements of the feature mapping (p of a. 
model are {z„k,w„z„k,z„k^ogy„k]n.i,k where n,i,k index word, 
term and topic respectively. For FS [3j and FESS [|9J, we 
extract features from the trained LDA model and deliver to 
SVM. The number of topics of LDA is set to 50. 

The CVCL scene dataset is chosen for evaluation. It con- 
tains 4 artificial scenes and 4 natural scenes. For each image, 
dense SIFT descriptors f6l are extracted from 20 x 20 grid 
patches over 4 scales. These descriptors are quantized to 
visual words using a code book (50 centers) clustered from 
some random selected descriptors. The resulting visual words 
of an image are in the form of histogram where each bin 
corresponds to a code center of the code book. The evaluation 
results are summarized in Table |II] Our results compare well 
with PHOW |,16J which is a state-of-the-art feature for scene 
recognition. The results of semi-supervised learning are shown 
in Fig. [T] demonstrating unlabeled examples can help classi- 
fication particularly when there are few labeled examples. 

C. Protein classification 

To evaluate the capability of the proposed approach in 
dealing with variable length sequences, we apply the proposed 
framework to remote homology recognition. The problem here 
assigns test protein sequences to the domain superfamilies 



defined in the SCOP (1.53) taxonomy tree according to func- 
tions of proteins. The protein sequence data is obtained from 
ASTRAL database with E- value threshold of 10"^^ to reduce 
similar sequences. We uses four labeled domain superfamilies, 
metabolism, information, intra-cellular processes and extra- 
cellular processes for evaluation. The numbers of sequences 
are 804, 950, 695 and 992 respectively. Each protein sequence 
is a string composed of 22 distinct letters, and the string length 
varies from 20 to 994. 

Hidden Markov model (HMM) ITOl is used to model the 
distribution over protein sequences for its ability in handling 
sequences with variable length. The number of output states 
is 22, and the number of hidden states is set to 10. Let 
X be the sequence with length T^, where x' be the binary 
indicator where = 1 if the k-i\\ state of K possible ones is 
selected at time /. Let q' be the binary state indicator where 
q'. - \ '\f the ;-th state of M possible ones is selected at 
time f; Amxm be the transition probabilities of the approxi- 
mate posteriors. The elements of the feature mapping (p can 
be written as {q',,Yj,r,Wfl'l\YJ,5,'q\q'l''^ogAi^^^^^^^ 
With the hidden states of the input sequence inferred by Baum- 
Welch algorithm |22|, it is easy to estimate the posterior 
transition probabilities conditioned on x. Using the sampling 
distribution derived in Eq. (fT?t . we are able to draw the 
examples of hidden states and re-estimate their posterior The 
results are reported in Table HID The 2-gram feature is actually 
the transition probability of observed states of a sequence, i.e. 
{YTlt~Q^ ^i^'k^)i,k- The difference of the performance of the 
first four methods are not significant except on family #3. 
The results of semi-supervised learning are reported in Fig . [1] 
which shows improvement on few training samples. 
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TABLE III: Accuracy (%±std.) of one-versus-rest protein recognition. 



SUP.FAM. 


2GRAM-SVM 


HMM-MAP 


FS-HMM 


FESS-HMM 


SFM-HMM 


# 1 


78.79 ± 1.13 


80.91 ± 1.53 


80.03 ± 0.78 


80.12 ±0.84 


82.75 ± 0.96 


# 2 


79.01 ± 0.97 


80.10±0.51 


77.56 ± 0.64 


78.96 ± 0.59 


83.08 ± 0.55 


# 3 


75.19 ±0.86 


77.92 ± 0.79 


73.31 ±0.21 


73.35 ± 0.41 


79.28 ± 0.64 


#4 


96.01 ±0.33 


95.10 ±0.39 


94.27 ± 0.37 


97.58 ± 0.13 


96.72 ± 0.63 



VI. Conclusions 

This paper presents a framework to incorporate the abihties 
of generative model and discriminative model for classification 
under the PAC-Bayes theory. The bridge of this incorporation 
is a stochastic feature mapping which is derived from the 
linear form of the practical MAP classifier and is independent 
with the parameters of the adopted generative models. Under 
this framework, the derived stochastic feature mapping and 
generative models can be tuned during the training of the clas- 
sifier. A major difficulty is the non-convexity of the objective 
function, where local minima can hamper the solution. Our 
approach can benefit from the development or exploitation of 
more robust and efficient optimization methods. 

Appendix 

Since J{Q) is intractable, we derive its upper bound by 
fixing 6+, g/(h+' h-)- Using Eq. ( fTOl i and Eq. ( fTlT l, we have 

KL(e \\P)^ KL(e(w, h+, h_) II P(w)P(x, h+ I 0+)P(x, h_ I 



= E 



log 



e(w) 



P(w) 



1 

m j—^ Qi 



log 



a(h+)a(h_) 



P(x„h+)P(x„h-) 



= ^\vi-Mo\\^-Cm 



esi{GQ) + -dsiGQ) 



1 "' 
VlogZ, 



i=i 



where = ^(w), g, = e,(h+,h_) and 

m m m 

^logZ/ = Yj l°g E[exp [-C Ee(„)[^,])] E[-CEeJ^,]] 



i=i 

mi 



1=1 



=y E 



mi 



±y E 



2m,„ 



^ — V olyiU-^ij) -H — y <l)(u-0,7)<t(-U-0y) 

min ^ ^ ■'' m,,n ^ ^ ■'' ^ 

where the inequality is derived by applying Jensen's inequal- 
ity; /w = /w(x;); hi = |^|xi!hl!^!h]!-)| where (h+y , h^,;) represents 
the j-th example drawn from (),(h+,h_). Now we have all the 
pieces for e^,, c/^ ^ and KL(2||P), and can obtain. 



m) = c 



+ -KL(Q II P) 



|U-Uo| 



1 

2m 

< -L||u-Uo| 

2m 
C 



~~2 y.^°sZi 



min 



nii n 

/=1 j=\ 



+ ;;7^ZZ'^("-^'v)i'(-u-0,v) = -/(u) 
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